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Abstract 



A new reorthogonalized block classical Gram-Schmidt algorithm is proposed that factorizes 
a full column rank matrix A into A = QR where Q is left orthogonal (has orthonormal columns) 
and R is upper triangular and nonsingular. 

With appropriate assumptions on the diagonal blocks of R, the algorithm, when implemented 
in floating point arithmetic with machine unit em, produces Q and R such that || / — Q'^Q\\2 = 
0{eM) and \\A — QR\\2 = 0(eM||^||2)- The resulting bounds also improve a previous bound by 
> ; Giraud et al. [Num. Math., 101(1):87-100, 2005] on the CGS2 algorithm originally developed 
^ by Abdelmalek [BIT, ll(4):354-367, 1971]. 
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^ ■ 

^ Introduction 

'^^r a matrix A G M*"^", m > n, we consider the computation of the Q-R decomposition 

. ; A = QR (1) 

where Q G M'"^'^ is left orthogonal (i.e., Q^Q = In) and R G M"^" is upper triangular. The matrix 
A is assumed to have full column rank. 

The approach considered is block classical Gram-Schmidt with reorthogonalization (BCGS2) which 
operates on groups of columns of A instead of columns in order to create a BLAS-3 ^ compatable 
algorithm. Thus we assume that A is partitioned into blocks A = {Ai, . . . , Ag), where each Ai has pi 
columns, i.e. Ai{m x pi) for i = 1, ... , s, with n = pi + p2 + . . . + Ps- 
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The block Gram-Schmidt algorithm that we present is a generalization of the classical Gram- 
Schmidt method with reorthogonalization (CGS2) which was first proposed and analyzed by Abdel- 
malek [1], but our analysis and development follow the flavor of that given by Giraud et al [5]. A 
similar block algorithm based upon CGS, justified only by numerical tests, is proposed by Stewart 
[T2] . Other block Gram-Schmidt algorithms are presented by Jalby and Phillippe [10] and Van- 
derstaeten [TS]. A goal of this paper is to establish the stability analysis results for BCGS2. As 
summarized in §3.3^ our analysis has implications for the algorithm in [H [S] and shows that the 
CGS2 algorithms produces a near orthogonal matrix under weaker assumptions than given in [5]. 
An excellent summary of the role of Gram-Schmidt algorithms is given in [3l §2.4, §3.2] and in the 
1994 survey paper [2]. 

We prove that BCGS2 is numerically stable under natural conditions outlined in §3] producing 
computed Q and R by BCGS2 in floating point arithmetic with machine unit Em that satisfy 

\\In-Q'^Qh < £A.//i(m,n,p) + 0(4,), (2) 
\\A-QR\\2 < eMf2{ni,n,p)\\A\\2 + 0{el,) (3) 

for modestly growing functions /i(-) and /2(-)) where p = maxi<j<sPi is the maximum block size of 
A. We assume that fj{m,n,p)eM < 1, j = 1,2, otherwise these bounds are meaningless. 

At the core of our block CGS method is a routine locaLqr, where for a matrix B e W^^^, 
p < n < m, produces 

[Q, R] = locaLqr(5) 

with 

Wlp-Q'^Qh < eMLi{m,p) <1, (4) 
\\B-QR\\2 < eMLiim,p)\\Bh (5) 

for some modest function Li{-). The routine locaLqr may be produced using Householder or Givens 
Q-R factorization. For appropriate BLAS-3 speed [1], that is, to take advantage of caching, the 
implementation of locaLqr may be done using the "tall, skinny" Q-R (TSQR) discussed in the 
recent Ph.D. thesis by Hoemmen [SI §2.3]. An interpretaton of [TJ §19.3] on the error analysis of 
Householder Q-R yield a function Li(m,p) = dimp^/'^ where di is a constant. If we make the 
assumptions in Theorem 13.41 for p = 1, then the CGS2 algorithm described by Giraud et al. [5] 
satisfies dH)-©. 

In ^ we present our algorithm, and in ^ we prove the properties ([2])- ([3]), followed by our 
conclusions in §H 

2 Reorthogonalized Block Gram— Schmidt 

First, we summarize the algorithm in [H [5]. Let A G M*"^*^ be given column-wise as 

A = (ai, . . . , a„). 
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To generate the decomposition ([T]) with R = (vij) and Q = (qi,...,q„), a step in the CGS2 
algorithm takes a near left orthogonal matrix U G M™^^, a vector b G M"^, and produces G M, 
Sfe G W and q?, G M"* such that 

cibn = {Im-UU'^fh, ||q6||2 = l, (6) 
b = f/Sfe + qferf,. (7) 

Ideally, q?, is orthogonal to the columns of f/, so ([6]) should be replaced by 

f/^qfe = 0, ||qb||2 = 1. (8) 

However, the condition ([8]) becomes hard to enforce as rf, approaches zero. Unlike the version of this 
procedure in [5], we scale the approximation to q?, at each step. That change has two benefits: (1) 
the function cgs2_step below is more resistant to underflow; (2) it leads to a natural generalization 
to a block algorithm. 

Function 2.1 [One step of CGS2] 

function [q^,, rf,, S;,] = cgs2-Step{U, b) 
= U^h; yi = b - Usi, 

ri = llyilh; qi = yi/n; 

S2 = t/^qi; y2 = qi - Us2; 

r2 = ||y2||2; Qb = y2/r2; 

Sf, = Si + S2ri; n = rari; 
end cgs2-step 

Notice that one step of CGS2 consists of exactly two steps of CGS. We have 

qiri = (J™ - f/ f/^)b, q,r2 = {In. - U f/^)qi, 

so, clearly, ([6]) holds. 

The CGS2 algorithm from lU |5] for computing the Q-R decomposition is stated next. 
Function 2.2 [Classical Gram— Schmidt with Reorthogonalization (CGS2)] 

function [Q, R] = cgs2{A) 
[m, n]=size{A); 
R=\\Ai:,l)h;Q = A{:,l)/R; 
for k = 2 : n 

[Qneui) f'newy Sneui]=Cgs2_Step(Q, A(^:, k))] 

-R = ( M ) ; Q = { Q ^.new ) ; 

y U ' new J 

end; 

end; cgs2 
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To obtain the new function block_CGS2_step, the block analog of cgs2_step, the function lo- 
caLqr substitutes for scaling the vectors. First we introduce the function block_CGS_step. Upon 
inputting B e W'p, U e M™''*, r + p < n < m, we produce Q E W^'p, R e and S e M*^p 

such that 

QR = {Im-UU'^)B, Q^Q = Ip, (9) 
B = US + QR. (10) 

Function 2.3 [One step of block CGS] 

function [Q, R, S] = block_CGS_step{U, B) 

S = U^B; 

Y = B-US; 

[Q,i?]=locaLqr(F); 

end block-CGS-step 



Since we assume that U and Q are near orthogonal in the sense that 

\\It-U^Uh < eMfi{m,t,p) + 0{el,)<l, (11) 
||/p-g^g||2 < SMLiim,p) + Oiel,), (12) 

a simple eigenvalue/singular value analysis yields the bounds 

mh < 1 + 0.5£M/i(m,t,p) + 0(4,) = 1 + 0(£m), 
IIQIh < l + 0.5eMLi{m,p) + O{elj) = l + O{eM) 

of which we will make generous use throughout our analysis. 

The behavior of this routine in floating point arithmetic is given by the next two lemmas. The 
proof of the first one is elementary, obvious, and will be skipped. 

Lemma 2.1 In floating point arithmetic with machine unit Em, Function 2.3 produces Q,R,S and 
Y such that for Li{-) defined in i^-iQ and for modestly growing functions L2{-) and L^[-) we have 



Ip-Q'^Qh < eMLi{m,p) (13) 

QR = Y + AY, ||AF||2 <£A/i^i(m,p)||5||2 (14) 

S + 6S = U^B, \\SSh<eML2{m,t,p)\\B\\2 + 0{el,) (15) 

Y + 6Y = B-US, \\6Yh<eMLs{t,p)\\B\\2 + 0{el) (16) 



If we use a standard matrix multiply and add routine, reasonable values for L2{m, t,p) and L^{t,p) 
are 

L2{m,t,p) = mt^'Y'\ Lsit^p) =p'/\l+t^/^). 
A second lemma yields a backward error bound. 
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Lemma 2.2 Assume the hypothesis and notation of Lemma \2.1\ then the computed Q and R from 
Function 2.3 satisfy 

QR ={Im-U U^)B + F (17) 



where 

Thus 

where 



F = AY -SY + UiSS). (18) 
\F\\2 < eMLFim,t,p)\\B\\2 + 0{elj), (19) 



LF{m,t,p) = Li{m,p) + L2{m,t,p) + L3{t,p). (20) 

Proof. We simply unwind the relationships from Lemma [2. II to obtain 

QR = Y + AY 

= B -US -5Y + AY 

= {Im-U U'^)B + U{6S) -6Y + AY 

= {Im-UU^)B + F 

which is f|T71) - f|T8l) . The use of standard norm inequalities yields f|T9|) - fl20|) . 
A norm bound that results from Lemma 12.21 is necessary for our error analysis. 

Lemma 2.3 Assume the hypothesis and notation of Lemma \2.1\ then the computed R from Function 
2.3 satisfies 

\\Rh<\\B\\2{l + 0{eM)). (21) 



Proof. Taking f|T7|) multiplying on the left by Q"^ and reorganizing terms, we have 

R = Q^ilrn - U U^)B + {Ip - Q)R + Q^F 

thus 

||i?||2 < ||/,n - U f/^||2||5||2 + - g^g||2||i?||2 + ll^^lb + 0{e\,) (22) 

To bound ||-R||2 in (!22l) . we first need to bound \\Im — U U'^\\2- Let U have the Q-R decomposition 



U = Z 



Ru 

0{m-t)xt 



where Z is orthogonal and Ru is upper triangular. Then 

\\It-U^U\\2 = \\It-Rj;Ru\\2<eMfiim,t,p) + Oiel,] 

and 



Since Rj^ Ru and Ru Rj} have the same eigenvalues, 

%-S S^h<eMfi{m,t,p) + 0{elj, 



Using the assumption flTT]) and the results of Lemma \2.2\ we have 

Wlm-UU^h = ma.x{\\It - Ru RuhA} 
= max{\\It - Ru RuhA} 

< max{eMfi{m,t,p),l} = l (23) 

by our assumption about fi{m,t,p) in ([2]). Thus, excepting 0{e\j) terms, 

PII2 < \\Irn-U U^\\2\\B\\2 + eMLF{m,t,p)\\B\\2 + eMLi{m,p)\\Rh. (24) 

Using ( l23l) . ( 12^ . and solving for ||-R||2 yields 

PII2 < [il + eMLF{m,t,p))/{l-eMLiim,p))]\\Bh 

= (1 + eM[LFim,t,p) + L,{m,p)] + OielMBy 

= \\BUl + 0{eM)) 

which is (EH). 



Now we introduce the function block_CGS2_step which consists of two steps of block_CGS_step. 
The first step of block_CGS2_step for given B produces Qi, -Ri, 5*1 such that 

B = USi + QiRi. (25) 

The second step takes Qi and yields Qb^R2, S2 satisfying 

Qi = US2 + QbR2. (26) 

From ([25])-([26D, it follows that 

B = USi + {QbR2 + US2)Ri = UiSi + S2R1) + Qb{R2Ri), (27) 

so 

B = USb + QbRb (28) 

where 

Sb = Si + S2R1, Rb = R2Ri- 
Function 2.4 [One step of block CGS2] 

function [Qb,Rb,Sb] = block_CGS2_step{U, B) 

[Qi,Ri,Si] = block_CGS_step{U, B); 

[Qb,R2,S2] = block_CGS_step{U,Qi); 

Sb = 5*1 + S2R1', 

Rb = R2R1', 

end hlock-CGS2-step 
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In exact arithmetic, 
Thus, we expect that 



QbRb = {Im-U U'^fB, Qb'Qb = Ip. (29) 



WU'^QBh < \\It-U^U\\l\\U^BR^% 

< \\It-U^U\\l\\Uh\\Bh\\R-s% 

< \\It-U^U\\l\\B\\2\\RB'\\2- 

Thus, as in (I6])-(I8]), ideally, f l29|) should be replaced by 

U^Qb = 0, Qb^Qb = Ip, (30) 

but fl30|) is hard to enforce if \\U'^ BR]^^\\2 is too large. When accounting for rounding error, we can 
only guarantee that ||f/"'"<5_B||2 is small if we bound ||i?||2||-R^^||2 as shown in (jH]). 
Also, we have two bounds that result from interpreting Lemma 12.31 They are 

WRih < WBUl + OisM)) (31) 
P2II2 < \\QiUl + Oi6M))<l + OieM) (32) 

which are freely used in our analysis. 
In exact arithmetic, 



An approach to developing a procedure similar to Function 2.4 is given by Strathopoulos and Wu 
[TT] . In our notation, they find Qb such that 

Range [( U B )] = Range [( U Qb )] 

that also satisfies (130|) . The focus of their paper is a procedure for locaLqr that is designed to be 
efficient in terms of storage accesses if B is "tall and thin" (i.e., m ^ p), but satisfies neither of the 
criteria well. The authors compensate by crafting a routine like that above, but using outer 

iterations to get Hf/^'^QBlh as small as possible and inner iterations to get \\Ip — Qb^Qb\\2 as small as 
possible. For our routine, the number of inner iterations is 1 and outer iterations is 2. The concern 
about "tall and thin" matrices B is alleviated by use of the "tall, skinny" Q-R (TSQR) as in [8l 
§2.3]. 

As well as repeating the two operations from block_CGS_step, there are two other operations for 
which we need error bounds. In fioating point arithmetic, the computed values of Sb and Rb from 
Function 2.5 satisfy 

Sb + 6Sb = S1 + S2R1, \\6SB\\2<EML,ip)\\B\\2, (33) 
Rb + 6Rb = R2R1, \\5RB\\2<eML5{v)\\B\\2, (34) 

where L^{-) and L^{-) are modestly growing functions. For conventional matrix multiply and add, 
Liip) = p^^^(l+p^''^) and L^{p) = p^. Using Weyl's inequality for singular values [HI Corollary 8.6.2], 
we have 

We{RB)-MR2Ri)\<\\5RB\\2<eML5{p)\\B\\2, £=!,..., p. (35) 
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In the block analog of Function 2.2, we partition A G M"*^" into 

A = (A,A2,...,^) (36) 

where Aj. G M"*^P'= for k = 1, . . . , s. In practice, ii p = In/s], then usually, Pi E {p — l,p}. In this 
input to Function 2.5, we define the parameter blocks as 

blocks = (pi, ... (37) 

Likewise, we partition Q into 

Q = {QuQ2,...,Qs). (38) 

We let 

Qk = iQuQ2,---,Qk), Ak = {Ai,A2,...,Ak) (39) 

and let 



k 



( Rii Ri2 Rik \ 

R22 R2k 

\ Rkk J 



(40) 



The initial step of factoring the first block is [Qi, -Ri]=local_qr(y4i) with Rn = Ri and Qi = Qi- 
Then for /c = 1, . . . , s — 1 we compute 5*^+1, Qk+i and Rk+i,k+i by Function 2.4 for B = Ak+i and 
U = Qk- Thus if 



Rk+l ~ Q Rk+ik+1 } ' ~ ( Qk+1 ) ! — ( ^k ) ! 

then 

^fc+l = Qk+lRk+l, k = 1, . . . , s — 1 

and, finally, A = QR, with Q = Qs and R = Rg. 
We summarize the algorithm BCGS2 as follows. 

Function 2.5 [Block Classical Gram— Schmidt with Reorthogonalization (BCGS2)] 

function [Q, R] = block_CGS2 (A, blocks) 

[m, n]=size(A); s=length(blocks); high = blocks(l); 

[(5,i?]=local_qr(y4(:, 1 : high)); 

for k = 2 : s 

low = high + 1; high = high + blocks(fc); 
[Qnew,Rnew,Snew]=^^ock_CGS2_step{Q,A{:,low : high))] 



(41) 



( r\ T3 )iQ (Q Qnew ) 1 

y U -Tlnew J 



end; 

end; hlock_CGS2 
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3 Error analysis of the algorithm BCGS2 



Our error analysis of Function 2.5 is the result of the error analysis of one step of Function 2.4 
followed by an induction argument. The details of our proof use standard error analysis assumptions 
and techniques. 

3.1 Error Bounds for Function 2.4 

To establish our error bound for Function 2.4, we establish two bounds. 
The first, on 

\\B-USb-QbRb\\2, (42) 

has no preconditions. The second, on 

WQBh (43) 
requires one of two assumptions. The first assumption is 

< eMfsing{m,t,p)\\B\\2\\RB^\\2 < 7 (44) 

where 

7 = LFim,t,p)/ fi{m,t,p) < 1, (45) 
fsingijn.t.p) = fi{m,t,p) + LF{m,t,p) +^L^{p). (46) 

The second is that 

,2\l/2 



R~,%<{l + lT (47) 



for 7 in (1451) . 
The first theorem covers (142|) . 



Theorem 3.1 Assume the hypothesis and notation of Lemma \2.1[ Then the computed Qb,Rb (^nd 
Sb from Function 2.4 satifies 

\\B-USB-QRRB\\2<eMfresid{m,t,p)\\B\\2 + 0{elj) (48) 

where 

fresidim,t,p) = 2Li{m,p) + 2L3{t,p) + L^{p) + L^ip). (49) 

Proof. Using Lemma [2.21 on the second block CGS step in Function 2.4 yields 

QbR2 = {l,n-U U^)Q^ + F2 

= Qi-US2-U{5S2) + F2 
= Qi-US2 + AY2-6Y2. 

Multiplying by R2 yields 

QBR2R1 = {Im-U U^)QlRl + F2R1 
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thus 

QbRb = {Im-U U^)QiRi + - Qb{5Rb). 



If we use the fact that F2 = AI2 — + U{5S2) from Lemma [2.21 then we have 

QbRb = QiRi - US2R1 + (Ar2 - ^Y2)Ri - Qb{SRb). 

Expanding QiRi using Lemma [2^ yields 

QbRb = {I„.-U U^)B-US2Ri + F, + {^Y2-8Y2)Ri-Qb{SRb) 

= B-USi- US2R1 - U{6Si) + Fi + (AF2 - SY2)Ri - Qb{SRb). 

Using the definition of Fi in Lemma 12.21 and the backward error for Sb in (!33l) we have 

QbRb = B-USb- U{5Sb) + AFi - 5Yi + (AF2 - 5Y2)Ri - Qb{5Rb). 

Using norm bounds yields 

\\B - usb - QBRBh < w^ssh + w^Y^h + imh + +\m\\2\\Rih + imuRih + imBh 

< SmUHp) + L,{m,p) + Ls{t,p) + L,{p)]\\B\\2 + [L,{m,p) + Ls{t,p)]\\R,\\2) + 0(6 

< £M(2Li(m,p) + 2Ls{t,p) + L,{p) + L,{p))\\B\\2 + 0(4/) 

= £M/resid("2,t,p)||5||2 + 

estabhshing (I48|)-fl49|). 
A crucial norm relationship is given by the following lemma. 

Lemma 3.1 Let R2 and Qi be result of implementing Function 2.4 in floating point arithmetic with 
machine unit Em- Then, if R2 is nonsingular, for 7 < 1 

\\U^QiR2%<l + 0{eM) (50) 

if and only if 

\\R~2%<{l + l'f'' + 0{eM). (51) 

Proof. We start with interpreting Lemma [2.11 for the second CGS step in in Function 2.4 which 
leads to 

QbR2 = iIm-U U'')Q, + F2. 

Taking the normal equations matrices of both sides yields 

R^Ql QbR2 = Ql{Im. - U U^fQi + F^{Irn - U U^)Qi + Ql{I^ - U U^)F2 + F^ F2. (52) 

An expansion of Qj{Im — U U^YQi produces 

Qj{Im-U U^fQi = QjQi-QjUU^Qi-QlU{It-U^U)U^Q, 

= I-QlUU^Qi (53) 
+ QlQi- I -QlU{h-U^U)U^Q^ (54) 
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so that the combination of fl52p and 0541) is 

RlR^ = I- QlUU^Qi + E (55) 



where 



E = El + E2 + E^, 

El = F^{Irn-UU^)Qi + Ql{Im-UU^)F2 + F^F2, 

E2 = QlQi-Ir,-QlU{It-U^U)U'^Qi, 

-^3 = R^iQ^ Qb — ip)R2- 



Since 



||£^i||2 < 2\\F2\\2+\\F2\\l<2eMLF{ni,t,p) + 0{e'' 



\m\2 < - QlQih + \\h - U^Uh + 0{el,) < eM[Liim,p) + /i(m,t,p)] + 0(4,), 
M\2 = \\Ip-QlQB\\2\\R2\\l<eMLi{m,p) + 0{elj), 



we have 

\\E\\2 < ||El||2 + p2||2+||^3||2 

< eM[fi{m,t,p) + 2LF{m,t,p) + 2Li{m,p)] + 0{elj). 

Now to show the equivalence between fl50|) and fl5T]) . Since we assume that R2 is nonsingular, we 
can rewrite (l55l) as 

Ip = i?2 ^-^2 ^ — -^2 ^ QlUU^ Q1R2 ^ + -R2 ^ ER2 ^ 

SO that 

= Jp + R-^QiUU^QiR2^ - R2^ER2\ (56) 
If Ai(-) is the leading eigenvalue of the contents, then fl56|) is given by 

Ai(i?^^/22-i) = 1 + \i{R2^QiUU^QiR2^) + ^\\R2'\\l (57) 

where 

lei < Iii?ii2. (58) 

Using the relationship, Ai(C'^C) = ||C||2 on ([57D yields 

||i?2^i^(l-0 = l+||f/^Qli?2~l2- (59) 

Thus assuming (150!) yields 

Il^2"l2 = [1 + (7 + 0(^m))']/(1 -0 = 1+7' + C?(^A/) 

establishing (l5T|l . Likewise, a similar algebraic manipulation of (159|) shows that (!5T!) implies (150|) . 
Before showing the effect of assumption (jS]), we need a small technical lemma. 
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Lemma 3.2 Assume f l44p and assume the hypothesis and notation of Lemma 2.1. Then Function 
2.4 produces nonsingular Ri and R2. 

Proof. Since the smallest singular value of Rb satisfies 

O'piRs) = ||-Rb^||2^ 

assumption (jH]) may be written 

'japiRs) > eMfsingim,t,p)\\B\\2 > 0. 

/^From ([35D, 
thus 

Q<eM[h{m,t,p) + LF{m,t,p)]\\B\\2<i<yp{R2Ri). (60) 
Thus -R2-R1 is nonsingular. Since -Ri and R2 are square, Ri and R2 are each nonsingular. 
We now show the effect of the assumption 



Lemma 3.3 Assume (jSj) and assume the hypothesis and notation of Lemma \2.1[ Then Function 
2.4 produces Qi and R2 such that fl50|) and flSTj) hold. 

Proof. From Lemma [3. ![ (!50l) and (!5T!) are equivalent so we need only prove (1501) . 
Interpreting Lemma [2.11 and using the result of Lemma [3.21 for the first CGS step in Function 2.4 
yields 

U^QiR-^ = U^{Im - U U^)BR2^R^^ + U'^FiR'^^R^^ 
= [{h-U^U)U^B + U'^F,]{R2Rir\ 

Thus 

\\U^QiR2% < [\\It-U^U\\2\\U^B\\2 + \\U^F42]\\{R2Rir'\\2 

< eM[fiim,t,p) + LF{m,t,p)]\\B\\2UR2Rir'\\2 + Oiel). (61) 

Combining ([60il and ([6T|) yields 

\\U^QiR2'h < 7^^p(i?2i?i)||(i?2i?i)~l2 + 0{el,) = 7 + 0(4/) 

which satisfies f[50|) . 
Now we have a conditional bound on ||f/^(5B||2 from Function [2J 

Theorem 3.2 Assume the hypothesis and notation of Lemma \2.1[ Assume also that U satisfies f[TT]) 
and either (jH]) or fj47l) holds. Then Function 2.4 produces Qb that satisfies 

WU^Qsh < SmII + V2]LFim,t,p) + Oiel). (62) 
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Proof. Applying Lemma [2.21 to the second block_CGS_step of Function 2.4 yields 
QbR2 = {Im-U U^)Qi + F2, IIF2II2 < eMLFim,t,p) + 0(4,). 

Thus 

U^Qb = {It - U^U)U^QiR^' + U^'F^R^'. 

Norm bounds lead to 

WW^Qsh < \\It - Uh\\U^QiR2% + \\F2h\\R2%- 
From Lemmas and 13.11 and 13. 3[ either assumption (144|) or ( 1471) yields (l50i) -(l5Tl) , thus 

WU^Qsh < iWh - U^Uh + (1 + l^Y'^F^h + 0{el,). (63) 
which from (145|) becomes 

WU^Qsh < eM[lfiim,t,p) + {l+^Y^LFim,t,p)] + Oiel,) 

< £M[(l + (l + 7')'/')^F(m,t,p)] + 0(e^) (64) 

Using 7 < 1 produces (j62l) . 

3.2 Error Bounds for Function 2.5 

Obtaining the bounds ([2]) and ([3]) are simply the result of induction arguments on Theorems 13.11 and 



In the arguments of this section, we assume that all of the blocks Ai,. . . ,As have the same di- 
mension, i.e., pi = ■ ■ ■ = ps = p. To have blocks of differing size, we could just assume that 
p = maxi<j<s Pi and make some other minor adjustments to the proofs in this section. 

We begin with ([3]) and let tk = kp. 

Theorem 3.3 Assume the hypothesis and notation of Lemma \2.1\ Let in (!39|) and in ( 140|) 
be the result of k steps of Function 2.5. Then for k = 1, . . . , s, Ak in ( l39l) satisfies 



\\Ak-QkRkh < eMf2{m,tk-up)\\Ak\\2 + 0{el,) (65) 

where 

f2{m,tk-i,p) = k^^'^fresid{m,tk-i,p). (66) 
Thus (E]) follows from (l65l) -(l66l) by taking k = s. 

Proof. For k = 1, Ai = Ai Qi = Qi and Ri = Ru, by our assumption ([5]) 

\\Ai-QiRi\\2 < SMLi{m,p)\\Ai\\2 + 0{elj) 

< eMfresidim,0,p)\\A\\2 + 0{elj) 

< eMf2{m,0,p)\\A\\2 + O{el,). 
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For the induction, step assume that the theorem holds up to step k and prove for step k + 1. For 
k < s, we have that 

ife+i - Qk+iik + l)p={ Au Ak+i )-{Qk Qk+i ) ( i?f+tfc+i ) 
where Sk+i e M*^''^ and Rk+i,k+i ^ Thus, 

ll^fc+l — Qk+lRk+lWl ^ W^k — QkRkWl + ll^ifc+l ~ QkSk+l — Qk+lRk-^l,k+l\\2- (67) 

The first term is bounded by the induction hypothesis, the second results from applying Theorem 

Oto Afc+i, Qk, Sk+i, Qk+i and i?fc+i,/t+i which gives us 



WAk-QkRkh < SMk ' fresid{m,tk-i,p)\\Ak\\2 + 0{eM) 

W^k+l — QkSk+1 — Qk+lRk+l,k+l\\2 < £ M f residing', tk, p)\\Ak\\2 + 0{e\f) 

Combining dSID, ([SSD, and ([SHD yields 

\\Ak+i - Qk+iRk+i\\l < elAkf^,s^di^,tk.^,p)\\Ak\\l + fL^di^^tk,p)\\Ak+l\\l] + 0(4/)- (68) 

Since fresidi.-) is monotone nondecreasing is all of its arguments and ||Afc||2, ||^/t+i||2 < ll^fc+ilh, 
(p5]) becomes 

||ifc+i -4+i^fc+i|l2 < ik + l)eljfLidi^^tk,p)\\Ak+i\\l + Oiel,). 

Taking square roots establishes the induction step of the argument. 
To prove the orthogonality bound ([2]), we need to make define 

def LF{m,tk-l,p) . . 

Ik = -Ti — T k = 2,...,n (69) 

/i(m,tfc_i,p) 

= ia' 



{k-l) + l) '\ a = Y7 + 4v^^ 3.56. (70) 

Let ^ be the upper triangular matrix produce in the second call to Function 2.3 in the kth step 
of Function 2.5. Our generalizations of assumptions (l44l) and (l47l) to Function 2.5 are 

fsing{'m,tk-l,p)\\Ak\\2\\R'kl\\2 < Ik (71) 

\\[Ri'Y%<{l+llY^\k = 2,...,n (72) 

where fsing{'nT',t,p) is defined by ( l46l) . Using these two assumptions, we have our final theorem. 

Theorem 3.4 Assume the hypothesis and notation of Theorem \3.3\ and that either assumption f lTTj) 
or ( !72|) holds. Then for k = 1, . . . ,n, 



\\It, - QlQkh < eMfiim,tk-i,p) + 0(4/) (73) 
where /i(-),7fc and a satisfy fl69|) - flTOj) . Interpreting f l73|) for k = s yields ([2]). 
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Proof. This is a proof by induction on Theorem 13.21 For = 1, we note that Qi = Qi which just 
results from [Qi,Rii] = local_qr(y4i). Thus 

Wip-QiQih = \\ip-Q^Qih 

For the induction step, assume tk < n that ( I73|) holds for k. Then 



hk+i)p — Qk+1 Qk+i 

so that 

||-^(fc+i)p - Qk+i Qk+ih 



~ Qk Qk QkQk+i 
Qk+iQk ip " Qk+i Qk+i 



T ri. IL _ 11/ ^tk~Q\Qk QkQk+1 

Qk+iQk ip — Qk+i Qk+i 

\^tk—QlQk\\2 WQkQk+ih 
WQl+iQkh \\Ip - Qk+iQk+ih 



< 



Invoking the induction hypothesis, applying Theorem 13.21 to Qk and Qk+i, and using the assumption 
(SD yields 



llr n ll ^ ^ W I fiim,tk,p) {1 + V2)LF{m,tk,p) , 

\\I(k+i)p~ Qk+iQk+ih < Sm\\\ r- / , X r /X ||2 + C;(£: 



-A/) 



(1 + A/2)LF(m,tfc,p) Li{m,p) 

/ , II / fi{m,tk-up) {1 + V^)LF{m,tk,p) \ II ,xo/,2x 
- (l + V2)Mm,4,p) L,{m,p) ) + ^(^m) 

< SMCorthijn.tk.p) + 0{e\j) (74) 

where using the implicit definition of /i(-) the definitions of 7fc and a in (l69l)- (j70l) . and of Lp{-) in 
f l20l) . we have 

Corth{m,tk,p) = (^/i(m,tfc-i,p) + 2((1 + V2)^L|(m,tfc,p) + Li(m,p)j 
We can bound Corth{-) by 

Corth{m,tk,p) < f /i (m,tfc-i,p) + 2((1 + V2YLp{m,tk,p) + Lp{m,tk,p)] 
< (^/2(m,4_i,p) + (7 + 4v^)L|(m,4,p)j 

= {fl{rn,tk-i,p) + a^L],{m,tk,p)) 
= {%'^Lp{m,tk-i,p) + a^L],{m,tk,p)y^ . 
Since Lp{-) is nondecreasing in all of its arguments 

Corth{'m,tk,p) < {%'^ + a^Y^"^ Lpim.tk.p) 

= %liLF{m,tk,p) = fi{m,tk,p) (75) 

Combining fl7i|l - (175|) yields the induction step for fl73l) . 
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3.3 Interpreting the Bounds for Function 2.5 

Function 2.2 is just Function 2.5 with p = 1 and with locaLqr producing Rb = (rb) and Qb = (qt) 
from B = (b) from the normahzation 

n = ||b||2; qb = b/rfe. 

When p = 1, tk = k and we will interpret it as such. 
In floating point arithmetic, the computed values satisfy 

n = \\hUl + 5), \6\ < (m/2 + l)eM + 0{el) (76) 

and 

cib = {Im + D)h/rb, D = diag{d.i}, ll^b < ^m- (77) 
From (I76])-([77]), it follows that 

||b - qtrbh < ^A.f||b||2 

and 

|1 - qt^qtl < (m + A)eM + 0{eli) 

which are (jl])-© with Li{m, 1) = max{m + 4, 1} = m + 4. 
The other operations of a CGS step are 



s + 6s = U^h 



where 

Thus, L2{'m, k, 1) = mk^/"^ . 
We also have 

where 



m2<eMmk^'^\\hh + 0{elj) 



y + (5y = b — f/s 



\m2<eM{l + k''^)\\h\\, + 0{el,). 
Thus L3(fc, 1) = 1 + k^/'^ . Thus one step is 

qr = {Im-U U^)h + f 

where 

\\ih<eMLF{m,k,l)\\h\\2 + 0{el,) 

and 

Lir(m, k, 1) = Li(m, 1) + L2(m, A:, 1) + L-i{k, 1) = m + A;^/^(m + A;) + 4. 
The two operations 

sb + 5sb = si + S2ri 
Tb + SrB = r2ri 
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satisfy 



ll^s^lh < 2eM\\h\\2 + 0{el,) 
\5rB\ < £A./||b||2 + 0(4/) 



so that L^il) = 2 and ^^5(1) = 1. Thus 

fres^d{m, k, 1) = 2(m + 4) + 2(1 + P/^) + 3 = 2m + 2^^^ + 13. 
Taking t = s = n, we have that Function 2.2 obtains a Q-R factorization satisfying 



A-QRh < eMn^^^fresid{m,n,l)\\A\\2 + 0{e 



M) 



< eM{2mn^'^ + 2n' + l?,n^'^)\\A\\2 + 0(4/)- 

The condition for near orthogonahty of Q has a nice interpretation. Assumption (17T!) may be 
written 

eMfsing{'m,k,l)\\a.k\\2 < lk\rkk\ (78) 



where again, 
and 



7fc = L^(m, fc, l)//i(m, fc, 1) = (^^(^ - 1) + l) 



-1/2 



Assumption f l72|) is 



fsing{m,k,l) = fi{m,k,l) + LF{m,k,l) + L^^l) 

= [(a2(/e- l) + l)'/' + l]L^(m,A;,l) + 2 
= [{a\k - 1) + 1)'/' + l](m + + ^) + 4) + 2. 

l45l>l/(l+7D'''- (79) 
(2) 

where r^^ is the diagonal element in the second Gram-Schmidt step at step k of Function 2.2. 
Either assumption leads to the bound 

||/ - Q^Qh < eMfi{m,n, 1) + 0(4,) 

where 

/i(m,n, 1) = 7^^Li.(m,n, 1). 

Notice that (1781) is merely an assumption that each of the diagonals of R is sufficiently bounded 
away from £:M||a/fc||2- There is no assumption on the condition number of R (or A) and (ITHl) much 
weaker than the assumption given by Giraud et al. [5] for Function 2.2. The second assumption, 

rOj) . is very similar to an assumption discussed by Abdelmalek [1]. 
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4 Conclusions 



Function 2.5 is a new block classical Gram-Schmidt Q-R factorization with reorthogonalization. We 
have shown that as long as the diagonal blocks on R do not become too ill-conditioned, the factor- 
ization produces a near orthogonal Q according to the criterion ([2]) and a small residual according 
to the criterion ([3]). 

Moreover, if we consider the block size 1, we have improved a bound of Giraud et al. [5] for Function 
2.2 showing that a near left orthogonal Q is produced if diagonals of R are bounded sufficiently away 
from zero. 

References 

[I] N.I. Abdelmalek. Roundoff error analysis for Gram-Schmidt method and solution of linear least 
squares problems. BIT, ll(4):354-367, 1971. 

[2] A. Bjorck. Numerics of Gram-Schmidt orthogonalization. Linear Algebra and Its Applications, 
197-198:297-316, 1994. 

[3] A. Bjorck. Numerical Methods for Least Squares Problems. SIAM Publications, Philadelphia, PA, 
1996. 

[4] J.J. Dongarra, J.J. DuCroz, I.S. Duff, and S.J. Hammarling. A set of level 3 basic linear algebra 
subprograms. ACM Trans. Math. Software, 16:1-17, 1990. 

[5] L. Giraud, J. Langou, M. Rozloznik, and J. Van Den Eshof. Rounding error analysis of the 
classical Gram-Schmidt orthogonalization process. Numerische Mathematik, 101(1):87-100, 2005. 

[6] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins Press, Baltimore, 
MD, third edition, 1996. 

[7] N.J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM Publications, Philadelphia, 
PA, second edition, 2002. 

[8] M.F. Hoemmen. Communication-avoiding Krylov subspace methods. PhD thesis. University of 
California, Berkeley, CA, USA, 2010. 

[9] R.A. Horn and C.A. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, UK, 
1985. 

[10] W. Jalby and B. Phillippe. Stability analysis and improvement of the block Gram-Schmidt 
algorithm. SIAM J. Set. Stat. Comput, 12:1058-1073, 1991. 

[II] A. Stathopoulos and K. Wu. A block orthogonalization procedure with constant synchronization 
requirements. SIAM J. Set. Comput, 23(6):2165-2182, 2002. 

[12] G. W. Stewart. Block Gram-Schmidt orthogonalization. SIAM J. Sci. Comput., 31(l):761-775, 
2008. 



18 



[13] D. Vanderstaeten. An accurate parallel block Gram-Schmidt algorithm without reorthogonal- 
ization. Numer. Linear Algebra AppL, 7:219-236, 2000. 



19 



